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Abstract 

Motivated by recent experimental developments in functional genomics, we construct and 
test a numerical technique for inferring process pathways, in which one process calls another 
process, from time series data. We validate using a case in which data are readily available and 
formulate an extension, appropriate for genetic regulatory networks, which exploits Bayesian 
inference and in which the present-day undersampling is compensated for by prior understanding 
of genetic regulation. 

Preprint number: NSF-ITP-02-47 

1 Motivation 

The last decade has witnessed stunning advances in experimental biology, particularly in the fields 
of neuroscience and genomics, which have made possible 'data-driven' biological investigations. As 
examples, the quantitative revolution of genomics has provided terabytes of transcriptome data; and 
neuroscientists routinely record for hours or even days from multiple neurons simultaneously. This 
transformation stands as a challenge to theorists who hope to advance understanding by making 
connection between experiment and first principles models .[] 

In genomics, for example, we are presented with the expression levels of thousands of genes, 
but our ability to model is limited not only quantitatively, in that there are myriad unknown rate 
constants and binding parameters, but qualitatively, in that a sizable fraction of proteins and genes 
remain of uncharacterized function (!]]. Similarly, in neuroscience, we can model patches of cellular 
membranes, synapses, and (at least electro-physiologically) entire cells However, this modeling 
hinges on numerous unknown parameters, and even if we can perform massive computations involved 
in the study of even rather small biological neural networks, the sensitivity to these parameters still 
makes the whole approach intractable. The astronomical amounts of experimental data are troubling 
computationally, but even more immediate problems are the lack of reductive descriptions of the 
underlying phenomena and undersampling — an inability of the data to determine the (slightly 
smaller) astronomical number of important microscopic parameters appearing in theoretical models. 

1 The mathematization of such models are referred to below as the 'microscopic equations'; consider for example 
those of fluid dynamics which govern, yet certainly fail to encapsulate, such phenomena as turbulence and the tumbling 
of a falling leaf. 
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Presented with such an imbalance, it is important to distinguish among the possible questions we 
can ask as well as the possible tools at our disposal for answering them. That is, one can ask what the 
system is doing (a nontrivial question when the language for discovery is an astronomical number of 
unorganized data) before one asks how it is doing what it does. The latter involves building models 
of some microscopic fidelity. The former may be answered without reference to microscopies by a 
model-independent, data-driven phenomenological approach. 

A useful historical analogy is that of particle physics of the late 1950's, in which an explosion of 
data from accelerators was equally daunting and similarly irreducible. At that time physicists were 
not yet asking the how questions (cross sections, isospin multiplets, etc.) but were instead carefully, 
statistically, inferring the presence or absence of features in the data; for example, exploiting prior 
knowledge of quantum mechanics to constrain reasonable shapes for peaks in the data (hallmarks 
of newly discovered particles — the 'resonances'). 

In this analogy, neuroscience is still dealing with the existence of peaks. Indeed, only recently 
(see Ref. B and further works by the same authors) it has become clear that precise timings of 
single spikes are very important for understanding the neural code. This is a basic, objective, 
model-independent observation, and it is not surprising that Shannon's information theory, which 
was specifically designed with these types of questions in mind, turned out to be extremely useful. 

In genomics, however, the quantities of interest are easier to identify: gene expressions are largely 
governed by the underlying regulation networks. Now is the time to attempt to infer these networks 
— still the what question — which corresponds to inferring the peaks in the data. This is a requisite 
step before classifying the possible networks and explaining the classification rules — an answer 
to how the system does what it does. Trying to answer how before carefully exploring what might 
ultimately produce many epi-manipulations of the data, but little significant understanding. 

Said otherwise, presented with data describing natural phenomena, one should form a phe- 
nomenology of experimental results, then inferences from the data in light of this phenomenology, 
and finally microscopic models. Genomics is currently at the penultimate step, and, armed with 
careful informatics, here meaning the incorporation of data with prior knowledge in the absence of 
detailed models, we hope to reduce the data into a representation which allows description, predic- 
tion, and ultimately control. 

An example of data reduction convenient for representation is cataloging of the regulatory net- 
works. However, such cataloging is not a model independent task: at the very least, our microscopic 
model includes the existence of the networks. Further, even if we are only interested in a network's 
connection diagram and do not care much about the exact details of the connections, identifica- 
tion of the network still involves determination of many parameters. Thus it is not clear that 
information-theoretic approaches will be of great use. However, it is plausible that our intuition of 
how the underlying microscopic dynamics translates into macroscopic probabilistic models may play 
a big role. The main purpose of this paper is to show that this intuition, appropriately mathema- 
tized in a principled way as the a priori knowledge — the priors in Bayesian statistics — may be 
self-consistently incorporated into a macroscopic probabilistic model of process pathways without 
detailed, sophisticated modeling of microscopic dynamics. We will first show this on a simple syn- 
thetic example, and then suggest some extensions of the ideas with an eye towards genetic regulatory 
networks. 



2 Functional genomics 

As mentioned, the motivating problem here is time series informatics applied to functional genomics.^] 
We therefore briefly review genetics and characterize the relevant experiments, the data from which 

2 It is important here to differentiate functional genomics, or 'post-genomics', from sequencing genomics. The 
latter is the set of techniques associated with obtaining the genetic sequence of an organism. The former is the set of 
techniques which try to put this information to use. 
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will be used in inferring the underlying connectivities and possibly control. 



2.1 A brief review of genetics 

The central goal of functional genomics is the understanding of the interactions among distinct 
parts of the genome — the genes. Each gene consists of long words composed of thousands of coding 
base pairs of DNA which are then transcribed into mRNA, which is then translated into protein. 
Many of these proteins, called transcription factors, then regulate the rate of production of mRNA 
transcribed either by their own genes or other genes. The working of all the genes thus forms a 
genetic regulatory network, and may be thought of as a dynamical system. Inputs include elements 
of the physical world which affect the activity of the transcription factors, and outputs may be 
considered as the concentrations of the translated proteins or, at a deeper level, the transcribed 
mRNA. While the proteins are ultimately responsible for cellular function, the mRNA are more 
easily experimentally measured via DNA microarrays. 



2.2 A brief review of DNA microarrays 

Only recently has it become possible to probe the expression of a number of genes comparable to the 
total number of genes in the entire genome of an organism via microarrays of nucleic acids, commonly 
known as 'DNA chips.' The most common application of such a chip is to monitor simultaneously 
the expression of thousands of genes by detecting hybridization of nucleic acid originating from a 
biological sample to target nucleic acids lying on the chip. One can then probe, for example, the 
differences in gene expression between cancerous and non-cancerous cells of the same specialization 
[ \ H p , 0, ||, H |l(| , or the expression of different genes as a function of the phase of the cell cycle 
[12|, or of the response of cells to chemical or physical perturbation. The two latter types of 
experiments produce time series of gene expressions, and they will be the focal point of our discussion 
from now on. 

The first DNA microarrays were made by Affymetrix in the early 1990s [l4[ ^5). In this 
technique, DNA oligonucleotides are attached to a surface in a specific spatial pattern, directed by 
optically activated chemical synthesis. One can build an arbitrary oligonucleotide sequence in a 
small area (approximately 20 microns per target) on the surface. However, the initial setup cost 
of creating the chip makes the technique infeasible for any application for which less than several 
hundred masks will be created (and sold). Individual researchers are completely without flexibility 
to change the chip to fit a particular area of investigation. 

Functional genomics further benefited from a second technology in 1996, when Pat Brown's lab 
at Stanford introduced the spot chip Jl6|, [l7|. This highly customizable technique exploits robotic 
deposition of drops on a microscope slide. The automation makes creating new and different slides a 
simple operation. Moreover, one can create typically 120 slides at a time. Individual researchers can 
thus design custom experiments, placing genes at locations or redundancies of their choosing. The 
gene fragments used in the spotting technique are hundreds of base pairs in length, and therefore 
less sensitive to single base pair mismatches. 



3 Methods 

3.1 Chemical network reconstruction 

Genomic data is certainly not the first dynamic data for which reverse engineering of the interaction 
network has been attempted. A similar problem has been faced historically in chemistry, in which 
one would like to infer the underlying reactions responsible for observed data. Such reaction networks 
typically are sparse, that is, the typical connectivity is far less than the total number of chemical 
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species. This is also true in genomics, where one gene typically interacts with no more than a few 
dozen others Jig] . 

To highlight the parallels, one may state the question as follows: armed with sufficient temporal 
data taken from a number of interacting reagents (here, chemicals), is it possible to infer the circuit 
diagram? One possible strategy was proposed in 1995 ]l9| ], tested first on simulated data, and later 
on the glycolytic pathway ]2C| , and recently refined in light of ideas from information theory pl| . 
However, this strategy has yet to be successfully applied to any reactions which were not known 
by the authors beforehand, nor subjected to a 'blind' test, as in the annual CASP test among the 
protein folding community.^] In addition, unlike in chemical kinetics, where the data is produced 
by moderately nonlinear and rapidly interrogated dynamical systems, genomic datasets are highly 
undersampled and are more like a set of fuzzy logic gates, or leaky boolean circuits. Thus, successful 
application of techniques inspired by chemical networks to genomic data is, at best, doubtful. 

We may nonetheless attempt reverse engineering of regulatory networks with a similar philosophy, 
in that, rather than trying to fit to a precise microscopic model, we attempt to parameterize a 
minimal phenomenological model and infer macroscopic parameters from it. 



3.2 Synthetic network reconstruction 

In order to test any new attempt to infer connectivity from dynamics, it is useful to study a system 
which is qualitatively similar, e.g., which demonstrates degrees of freedom which turn on and off 
other degrees of freedom in a near-complete or 'fuzzy logic gate' way and for which connectivity is 
sparse; yet for which data are readily available and obvious to interpret. To that end, we collected 
data from a multiuser UNIX machine, recording the relative CPU usage of all processes (the analogue 
of mRNA concentration), user ID, and process name, as a function of time. This can be done in an 
automated way via bash script ,Q and the results analyzed via MATLAB. 

A typical time course, automatically labeled via MATLAB, is shown in Fig. |l| for a particular 
(anonymous) user at a large department of applied mathematics. One axis is the job 'number,' 
ordered by frequency of occurrence over all users; the other axis is time (roughly in seconds). The 
height (and color) indicates CPU usage. 



3.3 Modeling synthetic data 

We begin with the minimal probabilistic model of process pathways, in which the strength of each 
process at subsequent time steps is linearly determined by the strength of all current processes. 
Similar linear models have been used with some success in understanding genomic data, including 
clustering via dynamics p2| . The most general model in discrete time is the AR(p) model, in which 
we include the possibility that the state now is a function of the p previous observations of the 
system. We do not yet include the possibility of hidden degrees of freedom. Mathematically, we 
may pose the model as 

g t = w + Mig t _ 1 +Magt_2 + ... + Mpg t -p + £ t (1) 
(£i,t&t'> - C«6 t , t ,, (2) 

where the degree of freedom at observation t is gt, the transition matrices are Mj, and the noise 
correlation C is as-yet undetermined. 

One may fit for the most probable transition matrices M.j as well as the offset Wo and the noise 
correlation matrix C, using, for example, the standard Schwartz's Bayesian Information Criterion 
p3[ to determine the most likely value of p.^ Excellent numerical techniques and general purpose 
libraries have been designed for solving this problem p4j . 



^Critical Assessment of techniques for protein Structure Prediction; http://predictioncenter.llnl.gov/ 

4 (GNU) Bourne-Again SHell 

5 See Sec. |5.l| for a brief discussion of this model selection technique. 
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Figure 1: CPU percentage for all the processes called by a particular user during the observation 
(approximately 10 3 seconds). Processes are numbered according to their relative frequency over all 
users during the observation time. 

Note that we do not claim the actual interactions among the processes are linear, as Eq. ([[]) seems 
to imply. Indeed, as stated above, the exact values of the transition matrices Mj are of very little 
interest to us, and we are only interested in the topological features of the network. It is reasonable 
to expect that the absence of a connection between two processes will be fit well by a zero in the 
corresponding transition matrix element, while the presence of a connection of any type will result 
in its nonzero value. The mismatch between the linear form of Eq. (|l|) and the actual dynamics will 
manifest itself in a large variance C u . However, if we are not interested in the exact values of Mj, 
this should not adversely effect our determination of connections. 



4 Results 

Fitting the observed CPU usages to the transition state model, one finds the most probable p value 
is 1, indicating a lack of inertia in the system; the resulting transition matrix Mi is plotted in Fig. |^. 
The noise covariance matrix was quite small, despite the naivete of the model. 

Causal connections between jobs are labeled with '— ►', e.g., 'emacs^latex' or 'emacs drives 
latex.' We highlight several remarkable features: 

1. Processes familiar to anyone who has used the typesetting software I^Tj^Xwill be readily ap- 
parent: one edits a file (e.g., in emacs, xemacs, or vi), then compiles with latex, and views 
the result in xdvi and finally ghostscript ('gs'). Similarly one observes emacs drives latex, 
latex drives gs, etc. 
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2. Note that the matrix is not symmetric: one axis describes processes which 'activate' other pro- 
cesses; the second describes which process is acted on. For example, latex drives ghostscript 
but ghostscript does not drive latex. 

3. The transition matrix shows 'upregulation' as well as 'downregulation': some processes dis- 
courage other processes at later times. 

4. Diagonal elements have not been labeled as they simply describe the likelihood the process 
will continue on to the next time step. 

5. Note also how the transition matrix correctly infers the highly sparse connectivity of these 
disparate jobs. The vast majority of elements are 0, as they should be, since processes are not 
influenced by those called by other users. 

For comparison, in Fig. ^| we also show an example of a transition matrix when the CPU usages 
are replaced with randomly generated data. Any reasonable structure is absent here. These results 
are in accord with our intuition that the proposed probabilistic model for data reduction, Eq. (^), 
although an incomplete description, still leads to a reasonable reconstruction of network connectivity. 



5 Modeling gene regulation 

In the example above, as mentioned, data are plentiful. In genomic time series, data are scarce. 
However, the above exercise is designed to test a phenomenological model into which one can incor- 
porate additional knowledge about genetic regulation. This goal, constraining possible models by 
compensating for sparse data with prior knowledge, is mathcmatizcd via Bayesian analysis. 
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Figure 3: The transition state matrix Mi resulting from random data. 

We emphasize that we are not using Bayesian analysis to attempt to fit for the innumerable 
unknowns in a microscopic model, e.g., a chemical kinetics model of transcriptional regulation. 
We are not interested in these parameters but in couplings and network topology. We instead 
augment the successful phenomenological or 'macroscopic' model above with prior knowledge about 
transcriptional regulation. 



5.1 Bayesian statistics 

A brief summary of Bayesian statistics is in order. We refer the interested reader to standard 
textbooks (cf. Ref. [p5[ ) for discussions of philosophical implications of Bayesian statistics, as well as 
standard statistical properties of Bayesian estimators. We focus below only on the relevant features. 
Bayes rule itself, 



P(b\a) = 



P(a\b)P(b) 



(3) 



is merely a rewriting of the rules of joint probabilities. The connection with interpretation of 
experiments is made by identifying b as the model, or specifically the vector of parameters 9 of the 
model; and a as the data D. Then 



P(9\D) 



P(D\9)P{9) 



P(D) 



P(D) ' 
d9P{9,D) = { d9P(D\9)P(0). 



(4) 
(5) 



In Eq. Q4p , the left hand side is called the posterior, and the first term in the numerator in the 
right hand side is called the likelihood. The strength of Bayesian methods comes from P(9), the 
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prior. The prior summarizes knowledge about the probability of a model before the the results, 
D, of an experiment are observed. When experimental data are abundant the exact specification 



of the prior is usually unimportant (cf. Refs. [ {26| , 27 ); when data are scarce the prior constrains 
the space of available models. While careless a priori assumptions may constrain the observer to 
the wrong part of the model space, in the analysis of genetic regulatory networks one may exploit 
well established knowledge about transcriptional regulation to construct appropriate priors, as we 



illustrate in Sec. 5.2.2 



Armed with the prior, one next finds the a posteriori expected value of the parameters: 



(8) D = Jd88P{8\D) = 



j d88 P{8)P(D\8) _ (8P{D\6)) 



P{D) 



(P(D\0)) 



(6) 



where (...) and (. . .}n denote expectations over the prior and the posterior respectively. When the 
number of data A is large we expect the posterior to be tightly peaked around some value 8 which 
maximizes the posterior (maximum a posteriori probability, or MAP, values), and then 8 is the first 
order term in the saddle point asymptotic expansion of (9)d in powers of X/N. 

Even for severely undersampled problems, A is usually large enough so that 8 w (8)d, arid it is 
tempting to replace integrals in Eq. (||) by their saddle point values. One of the greatest realizations 
in Bayesian theory in the last decades was that such replacement is wrong ^8|, |2(| ^o) . Indeed, 
when averaging over all possible models, each of the integrals in Eq. (k) will have a contribution 
from fluctuations around the value 8. For example, under some very general conditions, and with 
an assumption that log P(D\8) scales linearly in the number of data, the total probability of the 
data, P(D), has the following expansion in powers of 1/A 



log P{D)= log P{D\8) — — log A — i log det 



d 2 



d8,d8i 



log P(D\8) 



A 



log P(§) + o( A ) , (7) 



where K is the number of parameters in the model (dimensionality of 8). The integral in the 
numerator of Eq. (|^) can be written in a similar fashion. We see that the terms beyond the maximum 
likelihood contribution log P(D\8) are generally negative and their magnitude grows with K. Thus 
these terms provide a built-in punishment for model complexity. For this reason, they are known 
in the literature as as the Occam razor (cf. Ref. p0||). 

To illustrate, imagine that the prior admits two model families O 1 and O 2 with different param- 
eters 1 , 8 2 , such that K\ = dim© 1 < K2 = dim© 2 . As an example, consider fitting a function 
with a polynomial of low degree {K±) or high degree (A2). Usually we would expect the model 

family with more parameters to be better at explaining the data: P(D\8 2 ) > P(D\8 1 ). Thus if we 
were to choose a model family that explains the data best based on the maximum likelihood alone, 
a more complex model would win. However, the estimates within the larger model family, O 2 , are 
less robust to small fluctuations, and this is picked up by the integration over all parameters: even 

though P{D\8 r ) may be smaller than P(D\8 2 ), the relation between the probabilities of the model 
families P(Q 1 \D) and P(0 2 |D), as determined by Eqs. (|| 0), may be different. In particular, for 
A ^> 1 the likelihood term, which scales linearly with A, always wins, and Bayesian model selec- 
tion approaches that of the maximum likelihood. However, for small A the difference between the 
likelihood and the other terms in Eq. (Q) is less profound, and a simpler model family, which is not 
the best in explaining the data, may turn out victorious. 

In short, Bayes rule shows how a simpler model may be less likely, yet more probable. 

Before ending our quick review of Bayesian statistics, two more comments are in order. First, 
as the model selection arguments are mostly important for small A, where log A ~ 1, it would be a 
mistake to ignore O(l) terms in Eq. (]?]) and use just A/2 log A as a model complexity punishment 
(Schwartz's Bayesian Information Criterion p3| , also mentioned in Sec. 3.3). In particular, we 
believe that such replacement may be a cause of a common observation that the Bayesian Criterion 
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overpunishes complex models (cf. Ref. |31|). Second, even though in this work we will be mostly 
dealing with finite parameter models, application of nonparameteric methods to biological data 
certainly holds promise. Occam-type arguments for such cases have been discussed in, for example, 
Refs. [[§[§. 



5.2 Bayesian inference of regulatory dynamics 
5.2.1 A simple model 

Let g t stand for the vector of expressions (mRNA concentrations) of genes in a microarray experiment 
at time t. The number of genes, K s = dimg t , can be on the order of thousands. In principle, gt+i 
can depend on concentrations at all times that preceded t + 1. However, if we view the gene 
expression mechanism in cells as a realization of some chemical kinetics, governed by first order 
differential equations, then it is reasonable to expect that the concentrations depend only on their 
immediate past, i.e., gt+i = i(gt)^\ We begin with the simplest possible dynamic, a simplified 
version of Eqs. (|l|, ||): 



gt+i - gt 



o 2 Sij5 t ,t' , 



where the noise is Gaussian, and M and a are unknown. This is equivalent to 



(8) 
(9) 



P(g t+ l|g t ,M,<7) 



1 



(27R7 2 ) 



exp 



gt -Mg t \ 2 ) f 



(10) 



where (• ■ -)t indicates empirical averaging over time. 

Notice that unlike in Eqs. ([j], ||) the noise in this model is not correlated among the genes; 
moreover, variances are equal for all genes. Below we formulate an extension that incorporates hidden 
degrees of freedom (e.g., biochemistry) whose omission may otherwise lead to large or correlated 
noise (cf. Sec. 5.2.3). However, with only a handful of experiments available we cannot hope that 
data will be able to determine millions of elements of a full covariancc matrix. As data become more 



plentiful, it may even make sense to bypass Eqs. (£ 
feature extraction (as formulated in Sees. 6.1 and 



current simple model exhibits some of the success evidenced in Sec. |. 



|) completely and pursue model independent 
Below we pursue the possibility that the 



5.2.2 Biological priors 



Even in the simplistic form of Eqs. (|8|, |9J), the dynamic still contains too many parameters (~ Kg) to 
be tractable. We therefore search for biologically motivated priors to constrain the possible values of 
M. An example of such a prior would be, for example, that genes with similar regulatory sequences 
should be regulated similarly. More directly, genes whose promoters have similar numbers of certain 
important motifs should be co-expressed, an ansatz used with notable success by Bussemaker et 
al. |m| in discovering regulatory regions. 

This may be expressed in the following prior 



P{M\fi,£) ocexp 



' 2 ^ 



M. 



jk 



2 



(ii) 



6 Support for this choice is strong, as such models have been used with great success in the design of small genetic 
networks (see, e.g., Ref. |33| for a review). Further, in a study clustering genes by their dynamics, Ramoni et al. tested 



higher order models, but Tfound that the first order dynamics gave the most probable result [G 
in Sec. 1 1 above the first order model turned out the most probable, as well. 



Finally, recall that 
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where the first term punishes for different regulation of genes with similar regulating sequences, and 
the second assures proper normalization of priors by effectively constraining the range of possible 
Mij |^| Here dij is a distance function measuring deviation between regulatory regions of genes i and 
j in terms of the number of each regulatory motif appearing, weighted by the relative importance 
of that motif, found by considering the entire time series as in Ref. j34j or 'quality factors' as in 
Ref. |5). 

The parameter £ plays the role of a smoothing length, as in Ref. |3^ |, and, lacking a first- 
principles estimate of its value, we must integrate over £, weighted by an appropriate prior. As 
explained in Ref. |27|| , it is not only likely that such integration will choose the proper value of 
I almost independently of such prior, but it may even balance a slightly improper choice of the 
distance measure and the difference form [M ik — Mj k ) 2 . The same comments relate to the mass 
\x and the noise variance a as well. 

An enjoyable feature of this prior is that, like the likelihood, Eq. (|l0|), it is exponentially quadratic 
in the unknowns M. Thus the posterior expectations are Gaussian integrals, which may be performed 
analytically using the standard Wigner current technique Pq| . Following Eq. (^) we find for the a 
posteriori values of the connection matrix: 



Z{3) 



d 



dJ, 



logZ(JF) , 



j=o 

dadfj,d£P(a)P(fj,)P{£) 

;b + j) (A + gy 1 



dct A 



det{A + G) 



1/2 



x exp 



Here the curvature tensor A at the saddle point of Eq. (|11]) is given by 

A' 



.4 



ij,kl 



S ik Sji - 2£ 2 c ik Sji 



(12) 



(13) 



(14) 



and the time-lagged correlation, equal-time correlation, and 'closeness' (the inverse of the distance 
matrix) matrices are 



^<(gt+i-gi)g t T > t 

i( gtgt T ) f , g = n 



o 



if * ^3, 
if i = j. 



(15) 
(16) 

(17) 



The mass /i regulates the integrals and is expected to be small. Then dct ^4/ dct(A + G) scales as 
a large positive power of £ 2 /(£ 2 + const), and therefore decreases as £ decreases. On the other hand, 
the exponent in Eq. ( p"3| ) involves {A + g)~ 1 , which scales as l/(^ 2 + const) for large £. The exponent 
thus decreases as £ increases. We may then reasonably expect that the integrand in Eq. (|l^) will be 
peaked at some non-trivial value of £\ this peak should be sharp since both B and g involve large 



number of samples. This may be viewed as the smoothing length selection by the data 27 . 

Note that the priors over the hyperparameters cr, /x, £ in Eq. ( p"3| ) are as yet undefined. Since, 
as mentioned above (see also Refs. pq|), their actual forms are of little importance, we may hope 
that by choosing them appropriately we may be able to render the integrals in Eq. (|l3|) analytically 
tractable. 



7 Physicists will recognize these as the 'kinetic' and 'mass' terms in a Lagrangian, respectively. 
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5.2.3 Hidden control 



The formalism deserves experimental testing. However, one further extension offers a substantial 
improvement. DNA microarrays offer only a partial view into the workings of a cell, since numer- 
ous important degrees of freedom remain unobserved. In mathematical modeling of the yeast cell 
cycle, for example, considerable effort^] has been exerted to fine-tune models in which only chem- 
istry controls the processes, and the genetic expression is a mere passive function of this chemical 
control. While this may or may not prove to be an accurate characterization, some chemical control 
unobserved by gene chip experiments certainly exists. Inclusion will clearly necessitate a model of 
some structure other than that of Eq. (g). Unlike the assumption of linearity, which merely leads 
to misestimation of the values of interactions, this effect may make the dynamical system appear to 
be not of first order, or introduce a gene-dependent noise correlation matrix and artificial couplings 
between genes that dominate the real ones. To avoid this, we need to supplement the vector of genes 
g with a vector of unknown hidden degrees of freedom h. Then the evolution will take form 



gt+i 
ht+i 



Si 
h t 



(vW t >) 



irhh 



o 1 5ijSt,t' , 
Vij5 t ,t> ■ 



(18) 

(19) 
(20) 



Within Bayesian analysis, one may integrate over the unknown degrees of freedom and their 
possible couplings while remaining agnostic about the identities of h, and similarly sum over their 
possible dimensionality K^. As data are scarce, it is reasonable to expect that models with small 
number of hidden units will dominate the posterior. Thus we allow a full correlation matrix for 
the Gaussian noise in the hidden units, ry t , since this adds only a few additional parameters to our 
model (M sg , of course, has a few million), but allows necessary flexibility. 

One will need a prior over newly introduced M's. Since chemistry couples to expression via the 
transcription factors, and therefore via the regulatory sequences, we can write a similar prior as 
above, namely 



figh , £) oc exp 



EE 



M: 



ik 



M 



jk 



7 K 
fa 

2 



EEK 1 ) 2 



(21) 



with a different mass, but the same kinetic term as in Eq. ( |11| ) . In the absence of precise identification 
of the hidden degrees of freedom, we lack a biological prior on M hg , M hh and must therefore choose 
a prior that does not spoil the analytic tractability of the resulting integrals. 



6 Outlook 

The main difficulty in obtaining time series data is not biological or technological, but rather fi- 
nancial. Affymetrix chips, the more reliable of the two dominant technologies, are quite expensive. 
As estimated in p8[ , a 24-point time series with replication factor of 3 currently costs ~ $57,600. 
As the cost of the technology decreases, and as data become more plentiful, the role of priors in 
inferring connectivity and possible causation diminishes. 

Moreover, as noted before, with increasing data comes the possibility of model independent, 
nonparametric feature extraction by learning the joint probability distributions of expression levels 
at different times (cf. Ref. |3^]). We highlight two promising such directions below. 

8 See, e.g., Rcf. |37|] for a review. 
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6.1 Mutual information and entropy distance 



A completely model independent visualization tool of informatics is a low dimensional embedding 
of the connectivity diagram of the multiple genes via some meaningful metric. To this end, armed 
with a successfully learned joint probability one may use information theory to define distance in a 
principled way. 

An example of such a diagram based on information theoretic ideas is presented for simulated 
chemical kinetics in |2lf| , in which the mutual information 

/«->*) = / d 9i dg+P( 9i , 9 f)lo g2 ( p^pf^ j (22) 

was used for embedding, with the time lagged joint probability of degrees of freedom P(gi,g^) = 
{P{gi{t), gj(t + t)))j learned by histogramming. While mutual information is a useful similarity 
measure, it is not a distance in that it does not obey the triangle inequality. However, the 'entropy 
distance' 

D H (i - 3) = -J d 9i dgfP{g u g+) log 2 f ^g^^ j (23) 

does obey the triangle inequality J39[ and thus can be used as a metric, based on which one can 
form an embedding in lower dimensions and construct a process diagram as in Refs. Jl9| , pfl| , gjj . 
Here, by we mean the uncertainties in measurements of g{ij}, respectively. Note that this 

distance is reparameterization invariant under monotonic reparameterizations x — * f(x);y — > g(y), 
<J X -> crf/f', etc. 



6.2 Clustering by meaningful information: the information bottleneck 

Clustering without identifying a specific property of interest is meaningless. In most cases, if we try 
to learn from data (fit a curve, select a model, extrapolate, cluster, etc.) we are doing so not to find 
the parameters per se, but to use them to make predictions of future data pp| . Thus in dynamics 
the variables of relevance are the future gene expressions, and one should cluster data by maximally 
compressing them while retaining the most information about the subsequent time steps|] 

This idea was put on firm information-theoretic ground recently with the development of the 
information bottleneck ]4l|] , which, given a joint probability distribution between degrees of freedom 
(e.g., gene expressions at some time) and a quantity of interest (e.g., gene expressions at subse- 
quent times), allows an iterative calculation of the meaningful clusters in a probabilistic clustering 
algorithm. 

An important aspect omitted from current formulations of the information bottleneck is Bayesian 
integration over possible joint probability distributions; this procedure smoothes the data and avoids 
clustering noise. We expect that this will be one of the most promising as well as principled lines of 
research in bioinformatics. 



6.3 Prognosis 

The future is promising for such data-driven techniques: data are becoming more plentiful, com- 
putational power continues to exponentiate, and the data themselves are becoming more reliable, 
as those who hope to interpret them study more carefully their statistics and analyses.^ However, 

9 Lcl contrast, genes are usually clustered by similarity of their expression levels measured in some ad hoc metric 

±u bee, e.g., Ref. |4l M for one such careful analysis of Affymetrix data and Affymetrix's standard data analysis. 
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before any new techniques in computational biology will be widely exploited, they must be 'veri- 
fied' by comparing with results agreed upon in the biological community. In this case, verification 
will entail corroboration of inferred causal relations among genes (or within an inferred module of 
genes) with the biological literature. We anticipate that such time series based techniques will find 
common usage as tests for consensus with known connectivities become more standardized, and we 
look forward to their continued development and implementation. 
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